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Abstract 

This paper addresses the problem of identifying a lower dimensional space where observed 
data can be sparsely represented. This under-complete dictionary learning task can be formulated 
as a blind separation problem of sparse sources linearly mixed with an unknown orthogonal mixing 
matrix. This issue is formulated in a Bayesian framework. First, the unknown sparse sources are 
modeled as Bernoulli-Gaussian processes. To promote sparsity, a weighted mixture of an atom 
at zero and a Gaussian distribution is proposed as prior distribution for the unobserved sources. 
A non-informative prior distribution defined on an appropriate Stiefel manifold is elected for the 
mixing matrix. The Bayesian inference on the unknown parameters is conducted using a Markov 
chain Monte Carlo (MCMC) method. A partially collapsed Gibbs sampler is designed to generate 
samples asymptotically distributed according to the joint posterior distribution of the unknown model 
parameters and hyperparameters. These samples are then used to approximate the joint maximum 
a posteriori estimator of the sources and mixing matrix. Simulations conducted on synthetic data 
are reported to illustrate the performance of the method for recovering sparse representations. An 
application to sparse coding on under-complete dictionary is finally investigated. 

Index Terms 

Sparse representation, dictionary learning, Bayesian inference, Markov chain Monte Carlo 
(MCMC) methods. 
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I. Introduction 

In recent years, sparse representations have motivated much research in the signal process- 
ing community. This issue consists of identifying a sparse decomposition of a signal on a 
given dictionary. Among the main motivations, such representations have been demonstrated 
to be an efficient alternative for regularizing ill-posed inverse problems [[T|. More recently, 
compressive sensing has extensively benefited from sparsity to reconstruct a signal from a 
few projections [|2|, [|3|. Signal reconstruction under hard sparse constraints can be mainly 
formulated as an optimization problem of a £o-penalized quadratic criterion, whose numerical 
resolution is unfortunately an NP-complete problem. Several greedy algorithms have been 
proposed to approximate the signal reconstruction solutions, such as the well-known matching 
pursuit (MP) [4] and orthogonal matching pursuit (OMP) [5] algorithms. However, under ap- 
propriate sufficient conditions, replacing the £o-norm by the £i-norm in the penalization term 
can lead to the same unique solution [j6|. Therefore, exploiting these interesting sparseness 
properties, extensive works have been devoted on £i -constrained estimation problems for 
sparse representation (see for example ||7| and |[8|). 

In all the above works, the (generally over-complete) dictionary on which the signal is 
sparsely decomposed is assumed to be a priori known. The joint estimation of the atoms of 
the dictionary and the corresponding sparse representation is a much more challenging task. 
In Aharon et al. introduced an MP-based iterative method for designing over-complete 
dictionaries. Of course, the over-completeness allows redundancy in the atom decomposition. 
We address here the problem of recovering a sparse data representation in a lower-dimensional 
space defined by an under-complete orthogonal dictionary. Some up-to-date research activities 
conducted in the signal processing and machine learning communities have been focusing on 



this still open problem. Specifically, Mishali and Eldar have introduced in [10| an alternating 
minimization procedure to solve a complete sparse representation problem when the sparsity 
level is assumed to be known. In pT[ , [12|, and more recently in [13| the decomposition 
of a covariance matrix into sparse factors has been formulated as a regression problem 
with sparsity constraints. More generally, these matrix factorization strategies under some 
particular constraints, e.g., non-negativity, orthogonality and sparsity have demonstrated great 
interest for many different applications. These applications include representation of natural 
images p4| and gene expression data analysis p3| . 

In this paper, the under-complete dictionary learning task is formulated as a blind source 
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separation problem with sparsity constraints. Many applications have encouraged research on 



sparse signal and image deconvolution. These applications include astronomy [ 16 1, geophysics 



1 17 1, audio signal decomposition [18| and, more recently, molecular imaging [19|. In the 
present work, we propose a hierarchical Bayesian model for blind separation of sparse 
sources linearly mixed by an orthogonal matri}(Q This model is based on the choice of 
pertinent prior distributions for unknown parameters and hyperparameters. Following the 



works of Kormylo and Mendel [20|, the unknown sources are assumed to be Bernoulli- 
Gaussian (BG) processes. Therefore, the source prior is composed of a weighted mixture 
of a standard Gaussian distribution and a mass at zero. Note that this distribution has been 



widely advocated to solve reconstruction problems in a Bayesian framework (see [21|-[24| 
among others). However, estimating hyperparameters involved in such prior mixture is a 
critical issue that drastically impacts the estimation performance. As an example, the empirical 



Bayes (EB) and Stein unbiased risk (SURE) approach proposed in [25 1 experienced instability 
especially at high signal-to-noise ratios (SNR). In the adopted Bayesian estimation framework, 
several strategies are available to efficiently estimate these hyperparameters in an unsupervised 
manner. Lavielle et al. proposed to couple Markov chain Monte Carlo (MCMC) methods to a 
(stochastic) expectation-maximization (EM) algorithm p6| , p7| . A popular alternative to this 
hybrid strategy consists of introducing a second level of hierarchy in the Bayesian model by 
assigning non-informative prior distributions to the unknown hyperparameters [28, p. 383]. 
The joint posterior distribution of the unknown model parameters and hyperparameter is then 
approximated from samples generated by MCMC methods. This fully Bayesian estimation 
technique, followed in this paper, has been recently applied to signal segmentation [[29| and 



hyperspectral imaging [30|, [31 1 



Besides, standard MCMC methods have shown some limitations for deconvolving BG 



processes. More precisely, as noticed in [16| and [32 1, a standard Gibbs sampler can be 



stucked in a particular configuration of the BG process to be recovered, leading to poor 
mixing properties. Ge and Idler recently demonstrated that this BG deconvolution can be 
easily improved by marginalizing over the amplitudes of the non-zero components fy2\. The 
resulting MCMC scheme is a partially collapsed Gibbs sampler deeply studied by van Dyk 



et al. in [33] and [34[. Following this approach, we propose in this paper to take advantage 



'in the following, the mixing matrix is said orthogonal although it is not a square matrix. This abuse of language will 
mean that its columns, i.e., the dictionary atoms, are orthogonal. 
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of this MCMC strategy to estimate the sparse sources efficiently. 

To avoid redundant atoms in the dictionary and, more generally, ensure a full rank mixing 
matrix, we address the blind source separation problem under orthogonality constraint on the 
mixing matrix. The main motivation for imposing this orthogonality constraint on the mixing 
matrix is to capture more diversity among the recovered atoms belonging to the under- 
complete dictionary to be estimated. Only a few works in the signal processing literature 
have considered this additional property. Preliminary results on this issue have been reported 
in that has addressed the problem of sparse source separation from orthogonal mixtures. 
However, the strategy was based on a strong hypothesis for the source and mixing matrix, i.e., 
the prior knowledge of the sparsity level shared by all the sources. Hoff recently proposed 



in [35 1 a Bayesian formulation of the dimension reduction operators. More precisely, to 
estimate the rank of an unobserved matrix X involved in a noisy model, he has derived a 
Bayesian description of the singular value decomposition (SVD). The idea is to decompose the 
unobserved noise-free data X as X = UDV^ where U and V are matrices with orthogonal 
columns. The Bayesian inference on U and V has been finally conducted after assigning 
uniform prior distributions for U and V on their definition space, called the Stiefel manifold. 
This choice, coupled with the Gaussian properties of the noise, leads to von Mises-Fisher 
conditional posterior distributions for the columns of the matrices U and V. In the Bayesian 
orthogonal component analysis (BOCA) studied in this paper, a similar strategy is adopted 
by assigning a uniform distribution on the Stiefel manifold manifold to the mixing matrix. 
The resulting MCMC algorithm generates mixing matrix samples distributed according to 



the posterior distribution following the efficient scheme developed in [35 1. 

This paper is organized as follows. The BOCA is formulated as a blind source separation 
problem under constraints in Section |llj Section III derives the statistical quantities required 
to define the Bayesian model. The BOCA MCMC algorithm is described step-by-step in 



Section IV This algorithm allows one to generate samples distributed according to the joint 
posterior distribution of the source and mixing matrices. Simulation results conducted on 
synthetic data, as well as a performance comparison with the K-SVD algorithm, are reported 
in Section |Vj Section VI illustrates the interest of the proposed algorithm by solving a sparse 
coding problem with an application to natural image processing. Conclusions and potential 
future works are considered in Section IVIIl 
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II. Problem formulation 

Let x(t) = [xi{t), . . . ,XMit)]^ denote measurement vectors of R^^ observed at time 
instants t = 1, . . . , T by M sensors. These observations are assumed to be related to < M 
unobserved sources denoted s(t) = [si{t), . . . , s^lt)]^ via the matrix ^ in the following 
noisy linear model 

= *s(t) + n(t) (1) 

where n(t) stands for an additive measurement noise. Standard matrix notations yield 

X = *5 + AT (2) 

with X = [x{l), . . . , x{T)], S = [s(l), . . . , s(T)] and N = [n(l), . . . , n(T)]. The M x 1 
noise vectors n(t) {t = 1, . . . ,T) are assumed to be independent and distributed according 
to a centered multivariate Gaussian distribution A/" (Oj\/, cr^Ij\/). 

In this work, the M x N matrix ^ is assumed to be an unknown orthogonal matrix 

rr 1, if 2 = 7 

i^I^j = { (3) 
[ 0, if z ^ J 

where the sources to be recovered can be sparsely represented. Consequently, since only a 
few sources are assumed to be active at time index t, the unobserved vector of sources 
s(t) is sparse and contains only a few components that are non-zero. 

This paper proposes a Bayesian model as well as an MCMC sampling strategy to estimate 
the unknown sources S, the orthogonal matrix ^ and the noise variance a^. 

III. Bayesian model 

The unknown parameter vector associated with the mixing model defined in ([1]) is = 
{S , a"^}. This section gives the likelihood function of the observations and introduces 
prior distributions for the unknown model parameters (assumed to be a priori independent). 

A. Likelihood function 

The Gaussian property of the additive noise yields for each observed vector x(t) 

(4) 



M 
2 



1 9 

'\x{t) - 



where t = 1, . . . ,T and ||-|| stands for the standard £2-norm. By assuming the noise vectors 
n{t) to be a priori independent, the full likelihood function is 



^ ^ i=l 



(5) 
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B. Noise variance prior 



As in numerous works, including [29|, [30|, [36|, a conjugate inverse- Gamma distribution 
is chosen as prior distribution for the noise variance o"^ 

a2|7~J^(^,|) (6) 

where z/ = 2 and 7 is an unknown hyperparameter that will be estimated from the data. The 
main motivation for choosing conjugate prior distribution for is to simplify the computation 
of the posterior distribution of interest. 

C. Prior for the mixing matrix 

The mixing matrix to be estimated is an M x matrix with orthogonal columns whose 
rank is A^. The set of such matrices, denoted Sn,ai, is called the Stiefel manifolc^ (see (37) 
p. 8] for a general introduction of this space). To reflect the absence of any additional prior 
knowledge regarding the mixing matrix, a uniform distribution on this set is chosen as prior 



distribution for * [38, p. 279] 



where 1. (■) stands for the indicator function 



1 if * G 5jv,M, 

(*) = <; (8) 

otherwise. 



In Q, vol {Sn,m) is the volume of the Stiefel manifold Sj^i^m given by [39. p. 70] 



vol {Sn,m) = J. fN\ (9) 



where Tm (■) is the M-variate Gamma function 

_ , , M(M-l) TT _ / i — m \ 

TM{u) = n^^l[Tiu+^^] (10) 

m=l ^ ^ 

and r (■) is the Gamma function 

r+00 

T{u)= I e'\-'dt (11) 

with ti > 0. 







^Note that for the special case M — N, the Stiefel manifold St,t is the orthogonal group O [M) of orthogonal T x T 
matrices. 
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Generating samples according to (|7]) can be easily achieved by first sampling an M x 
matrix V of independent standard normal random variables and then by setting ^ = 



V (V^V) ^ [37|. However, as highlighted by Hoff in [35 1, sampling ^ via its condi 



tional distributions is frequently required, especially within an MCMC estimation framework. 



Therefore, we recall below the procedure proposed in [35| to sample orthogonal matrices ^ 
according to the uniform distribution ^ using the conditional distributions of its columns. 

Firstly, let = ['V'iljg^ denote the matrix formed by the columns of ^ indexed by the 
label vector A C {1,. . . , A^}, where stands for the ith column of ^. Let A/"^ denote 
an orthogonal basis associated with the null space of the orthogonal matrix Then, as 
demonstrated in [35], an orthogonal M x N matrix ^ can be uniformly drawn on the Stiefel 
manifold Sn,m via the following steps 

1) Sample vi uniformly on the unit Af- sphere and set i/;^ = Vi, 

2) Sample V2 uniformly on the unit (M — l)-sphere and set i/?2 = Ni'i^2, 

3) Sample v-^ uniformly on the unit (M — 2)-sphere and set i/'s = -^{i,2}i^3> 

N) Sample vj^ uniformly on the unit (M — + l)-sphere and set ipj^ = A/"!! 7v-i}i^Af- 
Uniform sampling on a sphere required in the scheme detailed above can be easily achieved 
following the normal-deviate method described in [40]. Finally, we have to mention that a 



similar strategy will be used in Section IV to sample mixing matrices ^ according to their 
conditional posterior distributions. 

D. Source prior 

Since the source vectors s(t) are sparse, most of the elements (n = 1,...,A^, 

t = 1, . . . , T) of the matrix S are expected to be equal to zero. Therefore, choosing a "sparse" 
prior for s„(t) is recommended. Coupling a standard probability density function (pdf) with 
an atom at zero is a classical strategy to ensure sparsity. This strategy has been widely used 



for located event detection [20| such as spike train deconvolution [17|, |41|, astrophysical 



frequency detection [[16|, sparse approximations of times-series p2| and reconstruction of 



molecular images [19|. We propose here to take advantage of this approach by choosing a 
BG distribution as prior for The distribution of this BG prior is defined as the following 
mixture 

f {Sn{t)\Xn,al) = {l-Xn)S{Sn{t)) + Xn9al{Sn{t)) (12) 
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where 6 {■) is the Dirac deka function and {sn(t)) is the pdf of the centered Gaussian 
distribution with variance a^. In ([12]), the unknown hyperparameter A„ is the prior probability 
of having an active source. Consequently, this hyperparameter tunes the degree of sparseness 
of the source vector s{t). Note that this probability A„ of having an active source, as well as 
the non-zero component variance a^, have been assumed to be different from one source to 



another to provide a flexible model. This strategy have been previously adopted in [24|, [42|, 



|43|. Another strategy would be to assume that the sources have a non-homogeneous sparsity 



level over times as explained in [44|. By assuming that the source amplitudes s„(t) are a 
priori independent, and introducing the index subsets Irt(e) = {t; s„(t) = e} (e G {0, 1}), the 
full prior distribution for the source matrix S is 



N 



/(5|A,a^) = n 



n=l 
N 

n 

n=l 



(1 - A„) 



m„(0) 



n ^(^-w) 

tex„(o) 



tex„(i) 



27ral 



exp 



(13) 



where A = [Ai, . . . , A„]^, = [aj, . . . , a^]^ and m„(e) = card {X„(e)}. Note that 



rrir. 



Ho- 



where s„ = [s„(l), . . . , s„(T)] and denotes the £o-norm, is the number of active 



components in the source n, whereas m„(0) = T — m„(l) is the number of components that 
are equal to in the source n. 



E. Hyperparameter priors 

A non-informative Jeffreys' prior is elected as prior distribution for the hyperparameter 7 

/(7)cx-1r+(7). (14) 

7 

An inverse-gamma distribution with fixed hyper-hyperparameters «o and ai (in order to 
obtain vague prior with large variance) is chosen as prior for the variance of the non-zero 
components in each source 

~ (ao,ai) • (15) 
A uniform distribution is chosen as prior distribution for the A„ in each source 

A„~W([0,1]). (16) 
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Assuming that the individual hyperparameters are independent the full prior distribution 
for the hyperparameter vector = {7, A, a^} can be expressed as 



1 ^ 

/(<^)cx-1r+ (7)n 

^ n=l 



(17) 



F. Posterior distribution 




Fig. 1. DAG for the parameter priors and hiyperpriors (thie fixed hiyperparameters appear in dashed boxes). 



The posterior distribution of {0, 4>} can be computed from the following hierarchical 
structure 

/(0,0|X)(x/(X|0)/(0|0)/(0) (18) 

where oc means proportional to, 

/(0|(^) = /(5|A,a2)/(*)/(a2|7) (19) 

and where / {X\6) and / (0) have been defined in ([5]) and ( fTTj ). This hierarchical structure 
is represented as a graphical model on the directed acyclic graph (DAG) of Fig. [1} In the 
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joint distribution ([T8|), the nuisance parameter 7 can be easily integrated out, leading to 



TM+v 

1 ^ 



exp 



1 ^ 



t=i 



AT 



X 



n 

71=1 
AT 

n 

n=l 



;i - A.) 



m„(0) 



n ^(^-w) 

tex„{0) 



n 



n=l 



Ar^'^ n 



iex„(i) 



exp 



Sn{tf 



ao + 1 



exp 



1[0,1] (Ar 



(20) 



Inferring the source matrix S and the orthogonal matrix ^ from ( [20] ) is not straightfor- 
ward, mainly due to the combinatory problem induced by the quantities ni(t) and no(t). 
In particular, closed-form expressions of the Bayesian estimators of S and ^ are difficult 
to obtain. We propose to use MCMC methods to generate samples that are asymptotically 



distributed according to the target distribution ( [20| ). These generated samples are then used 
to approximate the Bayesian estimators of S and ^ . 

IV. Partially collapsed Gibbs sampler for orthogonal component analysis 

OF sparse sources 

We describe in this section an MCMC method that allows one to generate a sample 
collection 



3^ 



S ,a 



2(h) . 



(h) 



'2(h) 



h=l,...,NMC 



asymptotically distributed according to the posterior distribution ( [20[ ). The interested reader 
is invited to consult [ |28| for more details about MCMC methods. 

The easiest way to sample according to this posterior would consist of using a standard 
Gibbs sampler whose main steps are 

1) sample S from / a^, A, a^, X), 

2) sample * from / (x^, A, a^, X) = /(*|5,a2,X) 

3) sample a"^ from / (a^jS, A, a^, X) = / (a^jS, X), 

4) sample A from / (A|5, a^, a^, X) = f (A|5), 

5) sample a^ from / {a?\S, a^, A, X) = / {a?\S). 

However, as highlighted in previous works, sampling BG processes following the crude Gibbs 
sampler detailed above [step 1)] often leads to poor mixing properties and weak estimation 



performance [16|. As an alternative a new MCMC algorithm for BG deconvolution was 
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recently studied [32|. The approach relies on explicitly introducing binary variables Q that 
indicate the presence of non-zero BG components. Then, these indicators are sampled after 
marginalizing over the BG variable amplitudes. This strategy casts the resulting MCMC 
algorithm as a partially collapsed Gibbs (PCG) sampler. Van Dyk and Park have described 



in [34 1 and [33| how PCG samplers can be efficient tools to overcome drawbacks inherent 
to standard Gibbs sampler, e.g., slow convergence. As detailed in the works cited above, 
PCG samplers consists of replacing some of the conditional distributions with marginalized 
conditional distribution. The resulting PCG sampling scheme can be summarized by the 
following steps 

1) sample Q from / a^, A, a?, X), 

2) sample S from / (5|Q, (x^, A, a^, X), 

3) sample * from / (x^ X) 

4) sample a"^ from / (cr^|5, X), 

5) sample A from / (A|5), 

6) sample from / (a^|5). 

Note that the source amplitudes have been marginalized to provide the discrete distribution 



appearing in step 1). The main steps of the PCG sampler are detailed in subsections IV-A 
IV-E (see also the step-by- step AlgojT]). 



to 



A. Sampling the indicator and source matrices 

The prior independence assumption of the source vectors allows one to rewrite the joint 
posterior distribution of the source matrix S as 



T 



f (5|A, a^, a^ X) = J] / {s{t)\X, a^, a^ x{t)) . (21) 



t=i 



Consequently, sampling according to / {S\X, a^, cr^, ^, X) can be achieved by successively 
sampling the source vectors s(t) according to / (s(t)|A, a^, cr^, x(t)) for t = 1, . . . , T. 



It is important to note here that standard derivations similar to those in [17| allow one to 
state that the conditional posterior distribution of the tth component s„(t) of the nth source 
is a BG distribution. However, as pointed out in [fT6|, sampling according to this distribution 
needs to explore the state space efficiently, which can be difficult mainly due to the difficulty 
of the Gibbs sampler to escape from local maxima. Recently, Ge et al have introduced a 
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Algorithm 1 Gibbs sampling for orthogonal component analysis of sparse sources 

1: % sampling the sources 

2: for t = 1 to T do 

% sampling the indicators recursively following Algo. [2] 

for = 1 to do 

sample the indicator g„(t) following the probability ( [23] ), 
end for 



sample the source vector s{t) from the pdf's in (|24]) and ( |25| ) 

8: end for 

9: % sampling the orthogonal mixing matrix 

10: for n = 1 to do 

11: compute the basis Nn of the null space of ^ n, 

12 
13 



sample v„ from the von Mises-Fisher distribution in (28) 



set V„ 



14: end for 



NnVn, 



15: % sampling the noise variance 



16: sample parameter from the pdf in ( [29] ) 

17: % sampling the probability of having active sources 

18: for n = 1 to do 



19: sample the hyperparameter A„ from the pdf in (31 1 



20: % sampling the active source variances 



21: sample the hyperparameter from the pdf in ( [32] ), 
22: end for 



performing MCMC algorithm to overcome this issue by explicitly introducing an auxiliary 



binary variable g„(t) that indicates the active sources [32| 

qn{t) = < (22) 
I 0, otherwise. 

Conditionally upon this indicator variable g„(t), the prior of the source component s„(t) in 
( fT2l ) can be easily rewritten 

f {Sn{t)\qn{t) = 0) = 6 {Sn{t)) , 
f {Sn{t)\qn{t) = I, a) =ga2 {Sn{t)). 
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The probability of having an active component in source n is governed by an unknown 
hyperparameter such that 



P[g„(t) = l] = A, 
P [qn{t) = 0] = 1 - A, 



In [32|, Ge et al. have proposed to sample the source vectors s{t) (t = 1, . . . ,T) and the 
indicators q(t) = [qi(t), . . . , qN(t)]^ using the following 2 steps 

1) Sampling according to f {q{t)\X,SL^,a'^,'^,x(t)), 

2) Sampling according to / {s{t)\q{t), A, a^, cr^, ^, x{t)). 

As mentioned above, these two steps make the resulting Gibbs sampler a PCG sampler and 
are detailed below. 

1 ) Sampling according to f {q{t)\X, a^, cr^, x{t)): Sampling according to / {q{t)\X, a^, cr^, x{t)) 
can be performed by updating the components g„(t) (n = 1,...,N) successively. As 



noticed in [32|, the posterior probability of having the source component g„(t) to be active 
given the other components denoted = [qi{t), . . . , g„_i()f:), g„+i(t), . . . , gAr(t)]^ is 

Mo - Ml" 



P [qn{t) = l\q_^{t),Xr.,al^,a\x{t)] 
with 



1 + exp 



n -1 



(23) 



u, = x{tfB;^x{t) + \og\B,\+2e\og ( - 1 

V An 

B, = al^dmg {qi^t)} + a^M, 

Q[e] (t) = [qi{t),---, qn-i{t) , e, qn+i (t),. . . ,qN{t)f ■ 

The probability in ([23]) can be efficiently computed following the recursive scheme initially 
introduced in pT| (and used in p2|), and adapted here to take into account the orthogonality 
property of ^. This numerical implementation relies on the Cholesky decomposition of B^ 
and the matrix inversion lemma. This avoids to calculate the compute-intensive inversion of 
B^ and the determinant \B^\ at each step of the Gibbs sampler. We describe in Algo. |2]how 
the component g„(t) is updated. 

2) Sampling according to f {s{t)\q{t),SL'^,a'^,^,x{t)): Conditionally upon the indicator 
variable qn{t), the distribution of s„(t) is defined by 

/(sn(^)|g„(^)=0,a^*,a^(^)) =5(s„(t)) (24) 
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Algorithm 2 Recursive sampling of indicator vector q{t) 

1: for n = 1 to do 

2: set 5n = 

3: set G = ['lpi]g^^t)=V 
a? 

4: set = ^, 

5: set r„ = 5„ + /i„ \\iPJ^ - ^^^GG^^^, 

6: set = a;(t)^V'n - T|^a;(t)^GG^V'n, 

7: set Am = log(5nr„,) - + 25„ log - l) , 

8: sample w ~ W[o,i]' 

9: if M > [1 + exp (-^)] then 

10: set = qn{t) + 

11: end if 

12: end for 



and 

[sn(^)],„W=i a^ a^, ~ A/" (AiGa^(t), A2) (25) 

where [snit)]q^(^t)=i stands for L x 1 vector composed of the active components in the source 
vector s{t), L = Wq^t)]]^, G = [V'n]g„(t)=i the M x L matrix composed of the columns 
of ^ corresponding to the active source components and Ai = diag < > and 

Ao = diag < — !--r r are L x L diagonal matrices with u„ = %. Note that this block 



sampling strategy, also adopted in [ |43| , avoids to sample the non-zero source components 
one-by-one. 



B. Sampling the mixing matrix 

The conditional distribution f {'^\S , a"^ , X) being intractable, this section describes how 

~(m) 

Gibbs moves can be used to generate samples according to the posterior distribution 

of each column of ^ conditionally upon the others. Let = [lAilj^n (rcsp. s_n{t)) denote 
the matrix ^ (resp. vector s{t)) whose nth column (resp. component) has been removed. By 
denoting 

Mn if) = Snit) [X{t) - *_„S_„(t)] (26) 
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Straightforward computations yield 

/ (V'nl*-™, S, a^ X) oc exp 



1 ^ 



1^ f*) 



(27) 



As detailed in [35|, conditionally upon ^'_„, Vn ^^^^^ be written = NnVn where Vn is 



uniform on the sphere and Nn is a basis for the null space of ^ a- Therefore, from ( [27] ), 
the conditional distribution of Vn is 



/ {vn\^-n, S, X) OC exp 



1 ^ 



(28) 



which is a von Mises-Fisher distribution with parameter ^ Ylt /-''nt-^n- A standard method 



to sample according to this distribution is given in [45 1. To summarize, the columns of ^ 
can be iteratively sampled conditionally upon the others by drawing samples v„ from a von 
Mises-Fisher distribution and setting tp,^ = NnV^. 

C. Sampling the noise variance 



Looking carefully at ( [20| ), the conditional distribution of the noise variance is an inverse- 
gamma distribution such that 

'TM 1 



2 ' 2 



t=i J 



(29) 



D. Sampling the probability of having an active source 

The conditional distribution of the hyperparameter A„ (i.e., the probability of the source 
n to be active) can be computed from ( [20] ) 

/(AJs„)oc(l-A„)"^"(°Ur^'^ (30) 



where mn{e), e G {0, 1}, has been defined in paragraph III-D Therefore, sampling according 
to /(A„|s„) is achieved as follows 



Aft I ~ Be (m„(0) + 1, m„(l) + 1) . 
where Be (a, b) is a Beta distribution with shape parameters a and b. 



(31) 



E. Sampling the variance of the active sources 

Straightforward computations leads to the following IG distribution as conditional posterior 
the variance of non-zero BG components in the source n 



1 



a^\Sn ~ ig [ ^mn{l) + ao, ^ \\s^\f + oli 



(32) 
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F. Inferring the sources and the mixing matrix 

The main objective of the proposed Bayesian algorithm is to estimate the source matrix 
S and the mixing matrix ^ from the data, independently from the nuisance parameters a^, 



A„ and a^. The MCMC algorithm detailed in paragraphes IV-A to IV-E generates samples 
asymptotically distributed according to the posterior distribution ([20]). Consequently, the 
MMSE estimators of S and ^ can be approximated by empirical averages over the A^mc— ^bi 
drawn samples as follows 

1 ^"^"^ ~{h) 

'S'mmse = E ^ — — y S (33) 

'*=^bi+i 

*MMSE = E ^ — V (34) 

iVMC - ^bi ^^^^^ 

where A^bi denotes the number of burn-in iterations of the sampler and A^mc is the total 
number of Monte Carlo iterations. 

Another important property of the proposed MCMC algorithm is that the generated pairs 



) 

form also a Markov chain whose stationary distribution is the joint marginal distribution 
f {S ^'*^\X). Therefore, the joint MAP estimator of (>S', ^') can be computed by retaining 
among the collection 



h=N^,i,...,NMC 

the sample that maximizes the marginalized distribution / {S, ^\X) [46 p. 165] 

(5map, *MAp) = argmax *|X) 

^ argmax / {S, ^\X) . 

Note that this joint marginal distribution f (S ,^\X) can be easily computed from the 
hierarchical structure ( [T8] ) that allows one to integrate out the hyperparameter vector 4> and 
the noise variance in the full posterior distribution ( |20l ), yielding 



f (5, .,X) oc n:^(l.^.(l),l.^.(0)) ^ ^(-°^ -^4^ ,36) 



where 5 (a, 6) = M. 
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V. Simulation results 

A. Performance analysis 

This section first considers a toy example to provide comprehensive and extensive results. 



We generate N = 2 sources of length T = 100 according to the prior distribution in ( |T2| ) 
with the probabilities of having active sources Ai = 0.05, A2 = 0.1 and the active source 
variances af = 100 and = 10. These 2 sources, represented in Fig. [I] (red), are mixed to 
obtain T = 100 observation vectors x(t) E ]R^° of dimension M = 50. 
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Fig. 2. Actual sources (circles, red) and corresponding MAP estimates (stars, blue). 



The generated orthogonal mixing matrix ^ is composed of = 2 basis vectors propor- 
tional to sinusoidal functions, ^pm,n oc cos {2Tcfnm + vr) (m = 1, . . . , M) with two different 
frequencies /i = 0.02 and /2 = 0.04. These two vectors are represented in Fig. [2] (red). 

The T = 100 observation vectors are corrupted by an additive white Gaussian noise with 
variance cr^ = 1.3 x 10"^, corresponding to a signal-to-noise ratio (SNR) SNRdB = 15dB 
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Fig. 3. Actual basis vectors (red) and corresponding MAP estimates (blue). 



where 



SNRaB = 101og,o|^£^|^^|. (37) 



The proposed Gibbs algorithm is applied on these noisy observations with A^mc = 1000 
iterations including A^bi = 100 burn-in iterations. Note that these numbers of iterations 
have been chosen to ensure convergence of the Markov chains. More precisely, as a first 
convergence assessment, the outputs of the Markov chains have been monitored for different 
parameters of interest. As examples, the source parameters A„ and (n = 1,2) generated 
by the proposed BOCA algorithm have been depicted as functions of the iteration number 
in Fig. |4] (top and middle, respectively). Note that the generated values converge towards 
the actual values of the corresponding parameters after very few iterations. As an additional 
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convergence criterion, the reconstruction error 

e{h) 



\ t=i 



(38) 

(h) 



has been evaluated as a function of the iteration number h (h = N^i + l, . . . , Nuc) where ^ 
and s^'^^t) are the MMSE estimates of ^ and s(t) computed following ( [33] ) after h iterations, 
respectively. The results depicted in Fig. |4] (bottom) show that A^mc = 1000 iterations are 
sufficient to ensure a small reconstruction error. The computation time required by 1000 
MCMC iterations is 17s for an unoptimized MATLAB 2007b 32-bit implementation on a 
2.2-GHz Intel Core 2. Of course, for more challenging problems, more MCMC iterations 
may be required. 
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Fig. 4. Top (resp. middle): source parameters A„ (resp. a^) generated by tlie proposed Gibbs sampler for the 1st (blue) 
and 2nd (green) sources. Bottom: error reconstruction as a function of iteration number. 

The obtained joint MAP estimates of the sources and mixing matrices are represented in 
Fig. |2] (stars, blue) and Fig. [3] (blue), respectively. These results show that the proposed BOCA 
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algorithm allows one to estimate the sources and basis vectors for this simple example. Note 
that the first active component in the second source signal has not been detected due to its 
very low amplitude. 

The proposed algorithm implicitly generates binary variables g„(t) that indicate the pres- 



ence/absence of non-zero source components (see paragraph |IV-A[ ). Thus, these indicators 
can be used to compute interesting statistics regarding the probability of having active 
components. As an example, the number Kn (n = 1, . . . , N) of active components in a 
given source s(n) can be estimated by 



(39) 



i=l 



The posterior probabilities of Ki and K2 estimated by the proposed method for the considered 
synthetic example are represented as histograms in Fig. |5] The actual numbers of active 
components Ki = 5 and K2 = 12 are represented by red dotted lines in these figures. 




Fig. 5. Estimated histograms of numbers of active components in the sources. The actual numbers appear in red dotted 
line. 
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The binary variables g„(t) have Bernoulli distributions. Therefore, the MMSE estimator 



of g„(t) provides the posterior probability of to be active. Following ( [33] ), the MMSE 
estimates of the indicators Qnit) are computed and represented in Fig. |6j These posterior 
probabilities are in good agreement with the actual positions of the active components, 
represented by red dotted lines in these figures. Note that these probabilities allow one to 
locate the first active component in the second source signal that has been previously omitted 
by the MAP estimator. 
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Fig. 6. Posterior probabilities of having active components. The actual active non-zero components appear in red dotted 
line. 



B. Performance comparison 

We propose here to compare the proposed algorithm with an up-to-date dictionary learning 
technique referred to as K-SVD algorithm [9]. This algorithm has been widely applied for 
various signal and image processing problems and has demonstrated promising results (see for 
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example [|47|, [|48|, and ||49|). For fixed signal dimensions M = 128 and T = 256, an M x T 



sparse matrix S is randomly drawn according to the prior in ( [T2| ) with Ai = . . . = Aat = 0.05 
and al = . . . = a% = 10. Then, an M x random orthogonal matrix ^ is selected as the first 
columns of the left orthogonal matrix provided by the singular- value-decomposition (SVD) 
of an M X M matrix whose elements have been drawn according to a A/'(0, 1) distribution. 
The number of dictionary atoms is set to three different values: = 4, 8 and 16. The 
observations, computed following (|2]), are corrupted by an additive Gaussian noise with SNR 
ranging from OdB to 20dB. The proposed Bayesian algorithm is applied on the generated 
data with A^mc = 300 Monte Carlo iterations and A^bi = 50 burn-in iterations. The accuracy 
of the MAP estimates of the source and mixing matrices Smap and ^'map are compared with 
the results provided by the K-SVD algorithm with a total number of 80 iterations (as in 
[j9|). The performance of these algorithms are expressed in terms of reconstruction error and 
sparsity. More precisely, the root mean square error (RMSE) between the actual noise-free 
data and the estimated reconstructed data ^'S is computed as 



RMSE 



\ 



MT 
t=i 



T ^ 

(40) 



\-Y,Us{t)-^s{t) 



The sparsity of the estimated source vectors is measured using the following score function 

k 

V ^ (41) 



NT 

n=l t=l 

Note that K E [0, 1] where K = 1 means that S is the N xT matrix of zeros whereas K = 
means that S is a matrix containing NT non-zero elements. The RMSEs and score functions 
k, computed over 100 Monte Carlo trials, are depicted in Fig.'s [v] and [s] as functions of the 
noise level SNR and for different values of the number of sources A^. 

Fig. |7] indicates that the Bayesian orthogonal component analysis (BOCA) generally out- 
performs the K-SVD algorithm in term of reconstruction RMSE. In other words, the proposed 
strategy provides a combination of source and mixing matrices S and ^ that better fit the 
observed data than the solution provided by K-SVD, especially at low SNR. Moreover, Fig. [8] 
shows that when the number of sources and the SNR are low, the sources estimated by BOCA 
are much sparser than the sources identified by K-SVD. Note that the sparsity level of the 
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Fig. 7. RMSEs as functions of the noise level SNR for different values of the number of sources A^. 

solutions provided by K-SVD is implicitly fixed by the number of non-zero entries in the 
orthogonal-matching-pursuit (OMP) sub-procedure, whereas the introduced BOCA estimates 
this sparsity degree via an unsupervised framework. 



VI. Application to sparse coding with under-complete orthogonal 

DICTIONARIES 

In this section, we present the ability of the proposed procedure to perform sparse coding 
with under-complete orthogonal dictionaries. A fraction of the well-known Barbara natural 
image is analyzed by BOCA. This 256 x 256-pixel image, depicted in Fig. |9] (column #1), 
is decomposed into T = 16^ block patches of size M = 16 x 16 pixels. The proposed 
Bayesian strategy and the K-SVD algorithm are applied on these observations for different 
values of the number of sources (i.e., different numbers of dictionary atoms). The images 
reconstructed by the algorithms after estimating the source and mixing matrices are depicted 



January 4, 2010 



DRAFT 



24 



0.9 - 




'(0 

a. 

0.4- 



0.3 



0.2 



0.1 



N=4 / BOCA 
N=4 / K-SVD 
N=8 / BOCA 
N=8 / K-SVD 
N=16/B0CA 
N=16/ K-SVD 



10 

SNR{dB) 



20 



Fig. 8. Sparsity as function of the noise level SNR for different values of the number of sources A'^. 



in Fig. |9] (column #2-5) for different values of ranging from = 4 to = 32. As 
in paragraph |V-B[ the estimation performances of both algorithms are evaluated in terms of 
reconstruction error (RMSE) and sparsity level (K expressed as a percentage). The RMSEs 
are reported for each reconstructed image and the sparsity measures appear between brackets. 
These results clearly indicate the reliability of BOCA for fitting the observed data and its 
ability of identifying a sparse representation. 

( ~{h)^ 

The Bayesian algorithm generates a collection of mixing matrices <^ > that 

I J h=l,...,NMC 

can be used to approximate the MAP estimator of ^ following ( [35] ). The MAP estimate of 
the dictionary atoms (i.e., the mixing matrix), formatted as block patches of size 16 x 16, 



are represented in Fig. 10 (left) for the corresponding values of the number of sources A^. As 



an illustration, the dictionary atoms estimated by K-SVD have been also depicted in Fig. [TO 
(right). 
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RMSE = 20.5(12%) RMSE = 16.9 (28%) RMSE = 13.2 (16%) RMSE = 9 (23%) 
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Fig. 9. Sparse coding of the Barbara image obtained by BOCA (1st row) and K-SVD (2nd row) with different values of 
A'^. Corresponding RMSEs and sparsity levels (expressed as percentage) are indicated above each image. 



VII. Conclusions and future work 

We introduced in this paper a new Bayesian algorithm for sparse representation with under- 
complete orthogonal dictionary. This problem was formulated as a blind separation problem of 
sparse sources mixed by an orthogonal matrix. The proposed approach relied on appropriate 
prior distributions for the unknown model parameters. The sparse sources to be estimated 
were modeled as Bernoulli-Gaussian processes. A uniform distribution on the Stiefel manifold 
was elected as prior distribution for the mixing matrix. The hyperparameters associated with 
this prior model were estimated from the data in an unsupervised fully Bayesian framework. 
A partially collapsed Gibbs sampler was studied to generate samples distributed according to 
the joint posterior distribution of the mixing matrix, source matrix, the noise variance and the 
model hyperparameters. The Bayesian estimators of the unknown model parameters were then 
approximated by using the generated samples. The estimation performance of the proposed 
algorithm was evaluated from simulations conducted on synthetic data. A comparison with the 
K-SVD algorithm showed very promising results in favor of the proposed Bayesian method. 
An application of the proposed sparse coding technique for natural image processing was 
also investigated. 

Future works include the unsupervised estimation of the number of sources N (i.e.. 
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Fig. 10. Estimated dictionary atoms by BOCA (left) and K-SVD (right) for different values of the number of dictionary 
atoms A'^. 



dimension of the subspace) using a reversible-jump MCMC algorithm as in [35 1. Extension 
of the proposed linear decomposition model to union of orthogonal dictionaries as in [[8| 
is currently under investigation. Finally, it would be interesting to apply BOCA to sparse 
coding in transformed domains for compression problems. 
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